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1. A personal comparative review. Reading this very interesting paper 
prompted me to review the current literature on the Lasso and sparsity 
and realize that there are at least three different sets of conditions on the 
behavior of the predictors Xi , . . . , X p under which the coefficients of the 
sparsest representation of a general regression model are recovered with an I2 



error of order y ^ logp, where s is the dimension of the sparsest model. These 
are, respectively, the conditions of this paper using the Dantzig selector and 
those of Bunea, Tsybakov and Wegkamp [2] and Meinshausen and Yu [9] 
using the Lasso. Strictly speaking, Bunea, Tsybakov and Wegkamp consider 
only prediction, not I2 loss, but in a paper in preparation with Ritov and 
Tsybakov we show that the spirit of their conditions is applicable for I2 
loss as well. Since these authors emphasize different points and use different 
normalizations, I thought it would be useful to present them together. Write 
the model as 

Y = X nxp (3 + e, 

where for simplicity we take e ~ iV(0, <7 2 I n ), a case falling under all the 
authors' conditions, and 

X = (Xi, . . . ,X P ). 

We begin by assuming that 

x/3e[x n ,...,xj = v, 

an unknown unique s-dimensional linear subspace of [Xi , . . . , X p ] and no 
lower-dimensional subspace. For simplicity, we follow Knight and Fu [8] and 
Meinshausen and Yu [9] and assume |Xj| 2 = n, 1 < j < p, which puts the 
problem on the same scale as the familiar: X a matrix of n i.i.d. p vectors. 
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In its current form, a goal of all three authors is to state conditions on X 
under which, for the method of estimation they propose, 

(1) \p-(3°\ = O p U^log P y 

where (3° corresponds to the "true" f3. This can be done by thresholding 
if Xi , . . . , X p are orthogonal and is optimal; see the work of Donoho and 
Johnstone [5]. 

In general, thresholding the least squares estimates of (3 is incorrect since 
when X T X has rank n < p, the LSE are not uniquely defined and the 
Dantzig selector and Lasso are computationally feasible ways of achieving 
(1) under different sets of assumptions. In the orthogonal case they both 
correspond to soft thresholding, as Candes and Tao [3] point out. All of 
these are based on the two key geometric qualities introduced by Donoho 
and Candes and Tao. 

Following Donoho [4] and Candes and Tao [3], define (p m \ n {m) to be the 
minimal eigenvalue of the Gram matrix Xj^Xi for all L with \L\ <m and 
^max to be the maximum eigenvalue of X T X. Here, Xl = {(X^ , . . . , Xi L ) : ij € 
L}. Let 

9 m ,m> = max{|(X L c L , X L c L >)\ : L n L' = 0, 

\L\ < m', \L'\ <m,\ci\ < 1, \cl' \ < 1} 

where cl is \L\ x 1 and cl> is \L'\ x 1. 
Denote the Lasso estimate by 

p L = argminj | Y - X(3\ 2 + A ]Tj % \ j . 

Then the conditions common to both the Dantzig selector and the Lasso 
are: 

Al. V^max < k < OO. 

The unicity of sparsest representation condition: 

A2. tpnfaps) > k > 0. 
The differences come in the condition which goes beyond A2 specifying 
how "near orthogonality" S and S are, where S is the complement of S. 
The condition of Bunea, Tsybakov and Wegkamp [2] is: 

A3 (BTW). p s = max{|(X i ,X j )| : i G L, j € L, \L\ < s} < *f for a con- 
stant M< ^. 

That of Meinshausen and Yu [9] is a great strengthening of A2: 

A3 (MY). if min (slogn) > e > for some e > 0. 
The Candes and Tao [3] condition is, in these terms: 

A3 (CT). 9 St2s < ^min(2s) < 1 and (2s) + 9 S)2s < 2. 
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That p s must decrease at rate ^ is clearly meaningful. As V gets bigger, 
all columns not in V need to become more and more orthogonal to it. How 
are these conditions related? I hope that the other discussants will shed light 
on this. 

2. The form of the Dantzig predictor. There is a compelling reason for 
using ||A (Y — X/3)||oo rather than ||Y — X/?||oo as part of the objective, 
which the authors are undoubtedly aware of, hinted at in Section 1.3, but 
did not make explicit. Suppose the (X J , Yj), 1 < i < n, are i.i.d. where X J is 
p-dimensional and we want to estimate /(X) = E(Y\~X.) using a dictionary 
{/}}, j > 1- Usually we imagine that /(X) = Y^TLi Pjfj(X) where the expan- 
sion is in the L2 sense and we expect that Yl^=iPjfjOQ ^ s a good approx- 
imation in the L2 sense. If we now identify Xj with (/,-(Xi), . . . , /,(X n )), 
then the minimizer of ||Y — A/3||oo is trying to minimize esssup|/(x) — 
X^=i/^'/j( x )| which can, no matter what p is, be very large. On the other 
hand, the Dantzig predictor needs to match correctly all / f (x) fj(x) dP(x) , 
1 < j < Pi which is just what is wanted. 

3. Model selection. As Candes and Tao [3] point out, from a statistical 
point of view the Dantzig selector, just as the Lasso, can be viewed as a 
method of model selection. Of course, the I2 norm results per se do not 
say which variables are the ones appearing in the sparsest model. But as 
they, and more explicitly, Meinshausen and Yu [9] point out, model selection 
after computation of the estimate (Dantzig selector or Lasso, resp.) can 
give an idea of what are the variables with large true coefficients; see also 
Wainwright [10] and Zhao and Yu [11]. I would argue that, in the large 
p, large n situations being considered, there may well be variables which 
are interpretable and, though necessarily highly correlated with variables 
which do appear in the sparsest model, themselves do not appear. Or, even 
worse, condition A2 may not hold. There may be two spaces S and S' of the 
same dimension s, each of which provides sparsest representations. Both the 
Dantzig selector and the Lasso in a case like this will, depending on starting 
point, converge to a point close to one of the representations — but in fact 
we would like to consider the whole space of such representations. 

Here is a simple example with s = 3. For simplicity, we consider the noise- 
less case. For simplicity, make the rows of A i.i.d. so that we can talk in 
population terms. We assume X±, X2, A3 are predictors with mean and 
variance 1. X\ and A3 are independent and 

A 2 = aA!+/5A 3 . 

Suppose, also, Y = X\ + A2 + A3. Then, there are three representations of 
Y corresponding to (X±, A2), (X2,X%) and (Ai, A3), and a Lasso fit might 
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converge to any of the three. If the number of predictors is small, exact 
collinearities or near collinearities can be identified, but this is not the case 
when p is large. 

Suppose though that we can identify collinearities as above. Is there a 
reasonable measure of importance of a variable? In the above example there 
are two representations involving each variable. For instance, for X 2 , 

r-KM 1 -?)*- 

Rewrite the models so as to take out the effect of the other variable on 
X 2 : 

Y=(l + ^j(X 2 -aX 1 ) + {a + l)X 1 , 

Y = (l + (X 2 - (3X 3 ) + ((3+ l)X x . 
A natural measure of the importance of X 2 in the presense of X\ is then 
SN 2 (2\1)^ ^i + I^V = (/3 + i) 2 , 

and similarly, 

SN 2 (2\3) = (a + l) 2 . 
It seems reasonable that we ascribe to X 2 overall importance as 

/(2)=max{(a+l) 2 ,(/3 + l) 2 }. 
If we add independent noise e% with variance a 2 to each observation, 

Y = X\i + X 2 i + X& + Si , 
then the natural definition of importance is 

SN 2 (2|1) SN 2 (2\3) \ 



1(2) = max|- 



<7 2 ' a 2 



If we estimate the coefficients of X 2 in each of the two models, we see that 
up to a constant, 2(2) is just the maximum of the squares of the t statistics 
for testing the hypothesis that the coefficient of X 2 is in each of the two 
models. 

How should we address the question of determining "importance" of a 
predictor? What we need first is the linear space spanned by {X^. : j € V } 
of all vectors X,- uncorrelated with Xj3. This can be done easily by using 
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the test statistic and using a cut-off of 




where a 2 is the residual 



mean square after a best predictive Lasso fit. This approach is worked out 
to eliminate more than just variables uncorrelated with Xf3 by Fan and Lv 
[6], who focus on situations where p> n. Once we have V, we would like 
to assess the importance of our variables. If \V\ = s, we are done using the 
usual least squares fit. However, we would typically have |V| 3> s. In that 
case since every variable can enter into a representation Xf3, we need to 
apply the Dantzig selector or the Lasso to reduce dimension. Unfortunately, 
unless all representations have \f3\i which differs by at most o p (\/ log p/n), 
both the Dantzig selector and the Lasso will converge to a minimum |/3|i 
solution. It is not clear to me how to obtain alternative representations 
when there are many variables without considering all possible subspaces, 
which is hopeless. However, it seems plausible that a first approximation 
to "importance" might be obtained as follows. Select a set of K candidate 
variables, say, the K most correlated with Y whether they appear or not in 
the Lasso or Dantzig fit. Regress each of these together with all but one of 
the N variables having nonzero coefficients in the Dantzig or Lasso fit on the 
fitted values obtained by Dantzig or Lasso. Retain only candidate variables 
such that, in at least one of the ./V resulting LS fits, the fit is good and the 
importance of the candidate in that context is large. 

The properties of such procedures under the circumstances I outline, many 
variables and collinearity, seem worth considering. Of course, not all nonzero 
coefficient variables in the Dantzig or Lasso fits necessarily have to be con- 
sidered. Again one could limit to a candidate set having large coefficients. 

4. Choice of A and a. Candes and Tao [3] do not dwell on the choice of 
A, which should be of order ay/2\ogn/p, but, in Section 4, simply specify the 

empirical maximum over several realizations of |X T Zj, or in our case ^ X ^^ 

with Z ~ A/"(0, I n ). If the goal is to minimize the I2 norm of \(3 CT — (3 \, where 
/3 CT is their estimate, it is worth pointing out that U-fold cross-validation 
can be used here as well. That is, choose a test subsample of size, say, 
logn and fit the Dantzig predictor to the remaining n — logn observations, 

say, for a grid of A values of width c \J^y^- Then choose the best A in the 
sense of I2 prediction for the test sample. This should give optimal rates 
if logn = o(logp), the typical case. If all elements are bounded in absolute 
value, this follows from Theorem 6 of Bickel, Ritov and Zakai [1]; see also 
Gyorfi, Kohler, Krzyzak and Walk [7]. Strictly speaking, this optimizes for 
prediction loss, not I2, but for (3 S the two are equivalent. 
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